Project: Identify Customer Segments¶

In this project, we will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that we will use has been provided by Bertelsmann Arvato Analytics, and represents a real-life data science task.

At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, we will report our findings for the completed section, as well as document the decisions that you made in your approach to each subtask.

In [ ]:
# import libraries
from IPython.display import display

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# magic word for producing visualizations in notebook
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = [16, 8]
plt.rcParams['figure.dpi'] = 200

Step 0: Load the Data¶

There are four files associated with this project (not including this one):

  • Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
  • Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
  • Data_Dictionary.md: Detailed information file about the features in the provided datasets.
  • AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. We will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.

To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so we'll need an additional argument in our read_csv() call to read in the data properly.

In [ ]:
# Load in the general demographics data.
azdias = pd.read_csv('dataset/Udacity_AZDIAS_Subset.csv', sep=';')

# Load in the feature summary file.
feat_info = pd.read_csv('dataset/AZDIAS_Feature_Summary.csv', sep=';')
In [ ]:
print(f'Shape of demographics dataset : {azdias.shape}\n')
Shape of demographics dataset : (891221, 85)

In [ ]:
azdias.head(10)
Out[ ]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0
5 3 1 2 2.0 3 1 5 2 2 5 ... 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0
6 -1 2 2 5.0 1 5 1 5 4 3 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 4.0 6.0 3.0
7 -1 1 1 3.0 3 3 4 1 3 2 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 2.0 5.0 2.0
8 -1 3 1 3.0 4 4 2 4 2 2 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 4.0 3.0
9 -1 3 2 4.0 2 4 2 3 5 4 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 3.0 1.0

10 rows × 85 columns

In [ ]:
azdias.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Data columns (total 85 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   AGER_TYP               891221 non-null  int64  
 1   ALTERSKATEGORIE_GROB   891221 non-null  int64  
 2   ANREDE_KZ              891221 non-null  int64  
 3   CJT_GESAMTTYP          886367 non-null  float64
 4   FINANZ_MINIMALIST      891221 non-null  int64  
 5   FINANZ_SPARER          891221 non-null  int64  
 6   FINANZ_VORSORGER       891221 non-null  int64  
 7   FINANZ_ANLEGER         891221 non-null  int64  
 8   FINANZ_UNAUFFAELLIGER  891221 non-null  int64  
 9   FINANZ_HAUSBAUER       891221 non-null  int64  
 10  FINANZTYP              891221 non-null  int64  
 11  GEBURTSJAHR            891221 non-null  int64  
 12  GFK_URLAUBERTYP        886367 non-null  float64
 13  GREEN_AVANTGARDE       891221 non-null  int64  
 14  HEALTH_TYP             891221 non-null  int64  
 15  LP_LEBENSPHASE_FEIN    886367 non-null  float64
 16  LP_LEBENSPHASE_GROB    886367 non-null  float64
 17  LP_FAMILIE_FEIN        886367 non-null  float64
 18  LP_FAMILIE_GROB        886367 non-null  float64
 19  LP_STATUS_FEIN         886367 non-null  float64
 20  LP_STATUS_GROB         886367 non-null  float64
 21  NATIONALITAET_KZ       891221 non-null  int64  
 22  PRAEGENDE_JUGENDJAHRE  891221 non-null  int64  
 23  RETOURTYP_BK_S         886367 non-null  float64
 24  SEMIO_SOZ              891221 non-null  int64  
 25  SEMIO_FAM              891221 non-null  int64  
 26  SEMIO_REL              891221 non-null  int64  
 27  SEMIO_MAT              891221 non-null  int64  
 28  SEMIO_VERT             891221 non-null  int64  
 29  SEMIO_LUST             891221 non-null  int64  
 30  SEMIO_ERL              891221 non-null  int64  
 31  SEMIO_KULT             891221 non-null  int64  
 32  SEMIO_RAT              891221 non-null  int64  
 33  SEMIO_KRIT             891221 non-null  int64  
 34  SEMIO_DOM              891221 non-null  int64  
 35  SEMIO_KAEM             891221 non-null  int64  
 36  SEMIO_PFLICHT          891221 non-null  int64  
 37  SEMIO_TRADV            891221 non-null  int64  
 38  SHOPPER_TYP            891221 non-null  int64  
 39  SOHO_KZ                817722 non-null  float64
 40  TITEL_KZ               817722 non-null  float64
 41  VERS_TYP               891221 non-null  int64  
 42  ZABEOTYP               891221 non-null  int64  
 43  ALTER_HH               817722 non-null  float64
 44  ANZ_PERSONEN           817722 non-null  float64
 45  ANZ_TITEL              817722 non-null  float64
 46  HH_EINKOMMEN_SCORE     872873 non-null  float64
 47  KK_KUNDENTYP           306609 non-null  float64
 48  W_KEIT_KIND_HH         783619 non-null  float64
 49  WOHNDAUER_2008         817722 non-null  float64
 50  ANZ_HAUSHALTE_AKTIV    798073 non-null  float64
 51  ANZ_HH_TITEL           794213 non-null  float64
 52  GEBAEUDETYP            798073 non-null  float64
 53  KONSUMNAEHE            817252 non-null  float64
 54  MIN_GEBAEUDEJAHR       798073 non-null  float64
 55  OST_WEST_KZ            798073 non-null  object 
 56  WOHNLAGE               798073 non-null  float64
 57  CAMEO_DEUG_2015        792242 non-null  object 
 58  CAMEO_DEU_2015         792242 non-null  object 
 59  CAMEO_INTL_2015        792242 non-null  object 
 60  KBA05_ANTG1            757897 non-null  float64
 61  KBA05_ANTG2            757897 non-null  float64
 62  KBA05_ANTG3            757897 non-null  float64
 63  KBA05_ANTG4            757897 non-null  float64
 64  KBA05_BAUMAX           757897 non-null  float64
 65  KBA05_GBZ              757897 non-null  float64
 66  BALLRAUM               797481 non-null  float64
 67  EWDICHTE               797481 non-null  float64
 68  INNENSTADT             797481 non-null  float64
 69  GEBAEUDETYP_RASTER     798066 non-null  float64
 70  KKK                    770025 non-null  float64
 71  MOBI_REGIO             757897 non-null  float64
 72  ONLINE_AFFINITAET      886367 non-null  float64
 73  REGIOTYP               770025 non-null  float64
 74  KBA13_ANZAHL_PKW       785421 non-null  float64
 75  PLZ8_ANTG1             774706 non-null  float64
 76  PLZ8_ANTG2             774706 non-null  float64
 77  PLZ8_ANTG3             774706 non-null  float64
 78  PLZ8_ANTG4             774706 non-null  float64
 79  PLZ8_BAUMAX            774706 non-null  float64
 80  PLZ8_HHZ               774706 non-null  float64
 81  PLZ8_GBZ               774706 non-null  float64
 82  ARBEIT                 794005 non-null  float64
 83  ORTSGR_KLS9            794005 non-null  float64
 84  RELAT_AB               794005 non-null  float64
dtypes: float64(49), int64(32), object(4)
memory usage: 578.0+ MB
In [ ]:
azdias.describe()
Out[ ]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
count 891221.000000 891221.000000 891221.000000 886367.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 891221.000000 ... 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 774706.000000 794005.000000 794005.000000 794005.00000
mean -0.358435 2.777398 1.522098 3.632838 3.074528 2.821039 3.401106 3.033328 2.874167 3.075121 ... 2.253330 2.801858 1.595426 0.699166 1.943913 3.612821 3.381087 3.167854 5.293002 3.07222
std 1.198724 1.068775 0.499512 1.595021 1.321055 1.464749 1.322134 1.529603 1.486731 1.353248 ... 0.972008 0.920309 0.986736 0.727137 1.459654 0.973967 1.111598 1.002376 2.303739 1.36298
min -1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.00000
25% -1.000000 2.000000 1.000000 2.000000 2.000000 1.000000 3.000000 2.000000 2.000000 2.000000 ... 1.000000 2.000000 1.000000 0.000000 1.000000 3.000000 3.000000 3.000000 4.000000 2.00000
50% -1.000000 3.000000 2.000000 4.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 ... 2.000000 3.000000 2.000000 1.000000 1.000000 4.000000 3.000000 3.000000 5.000000 3.00000
75% -1.000000 4.000000 2.000000 5.000000 4.000000 4.000000 5.000000 5.000000 4.000000 4.000000 ... 3.000000 3.000000 2.000000 1.000000 3.000000 4.000000 4.000000 4.000000 7.000000 4.00000
max 3.000000 9.000000 2.000000 6.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 ... 4.000000 4.000000 3.000000 2.000000 5.000000 5.000000 5.000000 9.000000 9.000000 9.00000

8 rows × 81 columns

In [ ]:
print(f'Shape of features attributes for the demographics data : {feat_info.shape}\n')
Shape of features attributes for the demographics data : (85, 4)

In [ ]:
feat_info
Out[ ]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]
... ... ... ... ...
80 PLZ8_HHZ macrocell_plz8 ordinal [-1]
81 PLZ8_GBZ macrocell_plz8 ordinal [-1]
82 ARBEIT community ordinal [-1,9]
83 ORTSGR_KLS9 community ordinal [-1,0]
84 RELAT_AB community ordinal [-1,9]

85 rows × 4 columns

In [ ]:
feat_info.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   attribute           85 non-null     object
 1   information_level   85 non-null     object
 2   type                85 non-null     object
 3   missing_or_unknown  85 non-null     object
dtypes: object(4)
memory usage: 2.8+ KB
In [ ]:
feat_info.describe()
Out[ ]:
attribute information_level type missing_or_unknown
count 85 85 85 85
unique 85 9 5 9
top AGER_TYP person ordinal [-1]
freq 1 43 49 26

Step 1: Preprocessing¶

Step 1.1: Assess Missing Data¶

The feature summary file contains a summary of properties for each demographics data column. We will use this file to help you make cleaning decisions during this stage of the project. First of all, we should assess the demographics data in terms of missing data.

Step 1.1.1: Convert Missing Value Codes to NaNs¶

The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. We'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. We might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.

In [ ]:
def convert_number_array(string_list):
    # Split the string into a list of values, remove leading/trailing whitespace
    number_list = [number.strip() for number in string_list[1:-1].split(',')]
    
    # Check if the list contains a single empty string, return None in that case
    if len(number_list) == 1 and number_list[0] == '':
        return None
    
    # Convert 'X' and 'XX' to their respective values, leave other values as integers
    number_list = [int(number) if number not in {'X', 'XX'} else number for number in number_list]
    
    return number_list
In [ ]:
# Create a dictionary mapping attribute names to missing or unknown values
feat_info_dict = dict(zip(feat_info['attribute'], feat_info['missing_or_unknown']))
feat_info_dict = {key: convert_number_array(value) for key, value in feat_info_dict.items()}

# Copy the original DataFrame
cleaned_azdias = azdias.copy()

# Replace missing or unknown values with NaN
for attribute, missing_values in feat_info_dict.items():
    if missing_values is not None:
        cleaned_azdias[attribute].replace(missing_values, np.nan, inplace=True)

Step 1.1.2: Assess Missing Data in Each Column¶

How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing. We will use matplotlib's bar() function to visualize the distribution of missing value counts to find these columns.

In [ ]:
# Perform an assessment of how much missing data there is in each column of the dataset.
print(f'Total number of NaNs in dataset before conversion : {azdias.isna().sum().sum()}\n\n')

azdias.isna().sum()
Total number of NaNs in dataset before conversion : 4896838


Out[ ]:
AGER_TYP                     0
ALTERSKATEGORIE_GROB         0
ANREDE_KZ                    0
CJT_GESAMTTYP             4854
FINANZ_MINIMALIST            0
                         ...  
PLZ8_HHZ                116515
PLZ8_GBZ                116515
ARBEIT                   97216
ORTSGR_KLS9              97216
RELAT_AB                 97216
Length: 85, dtype: int64
In [ ]:
print(f'Total number of NaNs in dataset after conversion : {cleaned_azdias.isna().sum().sum()}')
cleaned_values = cleaned_azdias.isna().sum().sort_values(ascending=False)

# Get only the most features with more than 70% of the mean for each feature
cleaned_values = cleaned_values[cleaned_values > 0.7 * cleaned_values.mean()]
Total number of NaNs in dataset after conversion : 8373929
In [ ]:
cleaned_azdias.head(20)
Out[ ]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 NaN 2.0 1 2.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN 1.0 2 5.0 1 5 2 5 4 5 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 NaN 3.0 2 3.0 1 4 1 2 3 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2.0 4.0 2 2.0 4 2 5 2 1 2 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 NaN 3.0 1 5.0 4 3 4 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0
5 3.0 1.0 2 2.0 3 1 5 2 2 5 ... 2.0 3.0 1.0 1.0 1.0 5.0 5.0 2.0 3.0 3.0
6 NaN 2.0 2 5.0 1 5 1 5 4 3 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 4.0 6.0 3.0
7 NaN 1.0 1 3.0 3 3 4 1 3 2 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 2.0 5.0 2.0
8 NaN 3.0 1 3.0 4 4 2 4 2 2 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 4.0 3.0
9 NaN 3.0 2 4.0 2 4 2 3 5 4 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 2.0 3.0 1.0
10 NaN 3.0 2 1.0 2 2 5 3 1 5 ... 2.0 4.0 2.0 0.0 2.0 3.0 3.0 4.0 6.0 5.0
11 NaN 2.0 1 6.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 NaN 3.0 1 6.0 5 3 4 2 4 1 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 3.0 6.0 4.0
13 NaN 1.0 2 5.0 1 4 3 5 5 2 ... 2.0 1.0 1.0 1.0 1.0 3.0 3.0 3.0 6.0 4.0
14 NaN 3.0 1 6.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 1.0 4.0 2 4.0 4 1 5 1 1 4 ... NaN NaN NaN NaN NaN NaN NaN 4.0 8.0 5.0
16 NaN 1.0 2 1.0 4 3 1 4 5 1 ... 3.0 3.0 1.0 0.0 1.0 3.0 4.0 1.0 2.0 1.0
17 NaN 2.0 1 6.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 NaN 2.0 2 6.0 2 4 1 5 4 1 ... 2.0 3.0 2.0 1.0 1.0 3.0 3.0 3.0 4.0 3.0
19 NaN 3.0 1 3.0 5 2 3 1 3 1 ... 2.0 4.0 2.0 1.0 2.0 5.0 4.0 4.0 6.0 3.0

20 rows × 85 columns

In [ ]:
# Investigate patterns in the amount of missing data in each column.
fig, ax = plt.subplots()
ax.bar(cleaned_values.index, cleaned_values)
plt.xticks(rotation=90)
fig.tight_layout()
ax.set_xlabel('Feature', fontsize=15)
ax.set_ylabel('Count', fontsize=15)
ax.set_title('Number of missing values for each feature', fontsize=20)
plt.show()
In [ ]:
# Remove the outlier columns from the dataset.
outliers = ['TITEL_KZ', 'AGER_TYP', 'KK_KUNDENTYP', 'KBA05_BAUMAX', 'GEBURTSJAHR', 'ALTER_HH']
cleaned_azdias.drop(outliers, axis=1, inplace=True)
cleaned_feat_info = feat_info[~feat_info['attribute'].isin(outliers)]
cleaned_feat_info.reset_index(drop=True, inplace=True)
In [ ]:
print(f'Shape of dataset after dropping outliers : {cleaned_azdias.shape}\n')
print(f'Shape of feature information : {cleaned_feat_info.shape}')
Shape of dataset after dropping outliers : (891221, 79)

Shape of feature information : (79, 4)

Discussion 1.1.2: Assess Missing Data in Each Column¶

Before parsing our feature summary file, we had 4,896,838 missing values which are recognized by pandas. After converting missing and unknown values with the help of feature summary file, we ended up with more than 8 million missing values(NaN).

The outlier features that were dropped from our dataset are:

  • TITEL_KZ
  • AGER_TYP
  • KK_KUNDENTYP
  • KBA05_BAUMAX
  • GEBURTSJAHR
  • ALTER_HH

Step 1.1.3: Assess Missing Data in Each Row¶

Now, we'll perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, we should see some groups of points that have a very different numbers of missing values. Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.

Depending on what we observe in your comparison, this will have implications on how you approach our conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special. Either way, we will continue our analysis for now using just the subset of the data with few or no missing values.

In [ ]:
# How much data is missing in each row of the dataset?
row_nan_count = cleaned_azdias.isna().sum(axis=1)
plt.style.use("seaborn-v0_8-whitegrid")
plt.hist(row_nan_count, facecolor = '#2ab0ff', edgecolor='#e0e0e0', linewidth=0.5)
plt.title('Rows\' Missing Data', fontsize=20)
plt.xlabel('NaN Count', fontsize=15)
plt.ylabel('Row Frequency', fontsize=15)
plt.tight_layout()
plt.show()
In [ ]:
# Divide the data into two subsets based on the number of missing values in each row.

# Threshold chosen is 25

threshold = 25

azdias_above_thresh = cleaned_azdias[row_nan_count>threshold]
azdias_below_thresh = cleaned_azdias[row_nan_count<=threshold]
azdias_above_thresh.reset_index(drop=True, inplace=True)
azdias_below_thresh.reset_index(drop=True, inplace=True)

print(f'Shape of data subset with missing data above threshold = {azdias_above_thresh.shape}')
print(f'Shape of data subset with missing data below (and equal) threshold = {azdias_below_thresh.shape}')
Shape of data subset with missing data above threshold = (93260, 79)
Shape of data subset with missing data below (and equal) threshold = (797961, 79)
In [ ]:
cleaned_azdias.isna().sum().sort_values().head(30)
Out[ ]:
ZABEOTYP                    0
SEMIO_REL                   0
SEMIO_MAT                   0
SEMIO_VERT                  0
SEMIO_LUST                  0
SEMIO_ERL                   0
SEMIO_KULT                  0
SEMIO_RAT                   0
SEMIO_KRIT                  0
SEMIO_DOM                   0
SEMIO_KAEM                  0
GREEN_AVANTGARDE            0
SEMIO_PFLICHT               0
FINANZTYP                   0
FINANZ_HAUSBAUER            0
FINANZ_UNAUFFAELLIGER       0
FINANZ_ANLEGER              0
FINANZ_VORSORGER            0
FINANZ_SPARER               0
FINANZ_MINIMALIST           0
SEMIO_TRADV                 0
ANREDE_KZ                   0
SEMIO_SOZ                   0
SEMIO_FAM                   0
ALTERSKATEGORIE_GROB     2881
LP_STATUS_GROB           4854
LP_STATUS_FEIN           4854
ONLINE_AFFINITAET        4854
RETOURTYP_BK_S           4854
GFK_URLAUBERTYP          4854
dtype: int64
In [ ]:
def compare_column_dist(data_1, data_2, data_1_annotation, data_2_annotation, 
                        title, column):
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6))
    plt.style.use("seaborn-v0_8-whitegrid")
    y_limit = max(data_1[column].value_counts().iloc[0], data_2[column].value_counts().iloc[0]) * 1.1

    #sns.countplot(data_1[column], ax=ax1)
    ax1.hist(data_1[column])
    ax1.set_title(data_1_annotation)
    ax1.set_ylabel('Frequency')
    ax1.set_ylim(top=y_limit)

    #sns.countplot(data_2[column], ax=ax2)
    ax2.hist(data_2[column])
    ax2.set_title(data_2_annotation)
    ax2.set_ylabel('Frequency')
    ax2.set_ylim(top=y_limit)
    
    fig.suptitle(title, fontsize=20)
    plt.tight_layout()
    plt.show()
In [ ]:
# Compare the distribution of values for at five columns where there are
# no missing values, between the two subsets.

# No missing values 
annotate_1 = "Rows Above Threshold (25) NaN Values"
annotate_2 = "Rows Below Threshold (25) NaN Values"
first_comparison_title = 'Compare Distribution with Columns with NO missing values'
for column in ['ZABEOTYP', 'SEMIO_DOM', 'FINANZ_ANLEGER', 'FINANZ_SPARER', 'GREEN_AVANTGARDE']:
    compare_column_dist(azdias_above_thresh, azdias_below_thresh, 
                        annotate_1, annotate_2, first_comparison_title, column)
In [ ]:
# Few missing values comparison
second_comparison_title = 'Compare Distribution between Columns with FEW missing values'
for column in ['ALTERSKATEGORIE_GROB', 'ONLINE_AFFINITAET', 'GFK_URLAUBERTYP']:
    compare_column_dist(azdias_above_thresh, azdias_below_thresh, 
                        annotate_1, annotate_2, second_comparison_title, column)

Discussion 1.1.3: Assess Missing Data in Each Row¶

We chose a threshold of 25 (amount of missing values) to subset our data and we found that the data with lots of missing values are qualitatively different from data with no or few missing values.

We will mark those data as special. Hence, we will not drop these features.

Step 1.2: Remove Highly Correlated Features¶

Before applying Principal Component Analysis (PCA), we perform a crucial preprocessing step by identifying and removing highly correlated features from the dataset. This process helps reduce multicollinearity, improve interpretability, enhance stability, and make the subsequent PCA analysis more computationally efficient. By eliminating redundant information, we ensure that the resulting principal components capture the most meaningful and unique aspects of the data, ultimately leading to more reliable and insightful dimensionality reduction.

In [ ]:
# Calculate the correlation matrix for numeric columns
corr_matrix = azdias_below_thresh.corr(numeric_only=True).abs()

# Create an upper triangular mask
upper_triangle_mask = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Use the mask to find highly correlated columns
columns_to_drop = [column for column in upper_triangle_mask.columns if any(upper_triangle_mask[column] > 0.95)]

# Create the correlation heatmap
plt.figure(figsize=(12, 10))
heatmap = sns.heatmap(azdias_below_thresh[columns_to_drop].corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Highly Correlated Features\n(Threshold = 0.95)', fontdict={'fontsize': 18}, pad=12)
plt.show()
In [ ]:
# Drop the highly correlated columns from the DataFrame
azdias_below_thresh.drop(columns=columns_to_drop, inplace=True)
azdias_below_thresh.reset_index(drop=True, inplace=True)

# Update our feature information
cleaned_feat_info = cleaned_feat_info[~cleaned_feat_info['attribute'].isin(columns_to_drop)]
cleaned_feat_info.reset_index(drop=True, inplace=True)
C:\Users\Ziad\AppData\Local\Temp\ipykernel_6872\2943986441.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  azdias_below_thresh.drop(columns=columns_to_drop, inplace=True)

Step 1.3: Select and Re-Encode Features¶

Checking for missing data isn't the only way in which we can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, we need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.

  • For numeric and interval data, these features can be kept without changes.
  • Most of the variables in the dataset are ordinal in nature. While ordinal values may technically be non-linear in spacing, make the simplifying assumption that the ordinal variables can be treated as being interval in nature (that is, kept without any changes).
  • Special handling may be necessary for the remaining two variable types: categorical, and 'mixed'.

In the first two parts of this sub-step, we will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, we will create a new data frame with only the selected and engineered columns.

In [ ]:
# Number of features of each data type?
cleaned_feat_info['type'].value_counts()
Out[ ]:
ordinal        49
categorical    15
numeric         6
mixed           5
Name: type, dtype: int64
In [ ]:
# Checking ordinal and numeric values sanity

ord_num_feats = cleaned_feat_info.loc[cleaned_feat_info['type'].isin(['ordinal', 'numeric']), 'attribute']
ord_num_values = [azdias_below_thresh[column].dropna().unique() for column in ord_num_feats]
for value, feat in zip(ord_num_values, ord_num_feats):
    print(f'Attribute: {feat} with unique values: {value}')
Attribute: ALTERSKATEGORIE_GROB with unique values: [1. 3. 4. 2.]
Attribute: FINANZ_MINIMALIST with unique values: [1 4 3 2 5]
Attribute: FINANZ_SPARER with unique values: [5 4 2 3 1]
Attribute: FINANZ_VORSORGER with unique values: [2 1 5 4 3]
Attribute: FINANZ_ANLEGER with unique values: [5 2 1 4 3]
Attribute: FINANZ_UNAUFFAELLIGER with unique values: [4 3 1 2 5]
Attribute: FINANZ_HAUSBAUER with unique values: [5 2 3 4 1]
Attribute: HEALTH_TYP with unique values: [3. 2. 1.]
Attribute: RETOURTYP_BK_S with unique values: [1. 3. 2. 5. 4.]
Attribute: SEMIO_SOZ with unique values: [5 4 6 2 7 3 1]
Attribute: SEMIO_FAM with unique values: [4 1 5 7 2 6 3]
Attribute: SEMIO_REL with unique values: [4 3 2 7 5 1 6]
Attribute: SEMIO_MAT with unique values: [3 1 2 4 7 5 6]
Attribute: SEMIO_VERT with unique values: [1 4 7 2 6 5 3]
Attribute: SEMIO_LUST with unique values: [2 4 6 7 3 1 5]
Attribute: SEMIO_ERL with unique values: [2 6 7 4 5 1 3]
Attribute: SEMIO_KULT with unique values: [3 4 6 5 7 1 2]
Attribute: SEMIO_RAT with unique values: [6 4 3 2 7 5 1]
Attribute: SEMIO_KRIT with unique values: [4 7 3 1 5 6 2]
Attribute: SEMIO_DOM with unique values: [7 4 2 1 5 6 3]
Attribute: SEMIO_KAEM with unique values: [4 7 5 2 3 6 1]
Attribute: SEMIO_PFLICHT with unique values: [7 3 4 5 1 6 2]
Attribute: SEMIO_TRADV with unique values: [6 3 4 2 7 5 1]
Attribute: ANZ_PERSONEN with unique values: [ 2.  1.  0.  4.  3.  5.  6.  7.  8. 12.  9. 21. 10. 13. 11. 14. 45. 20.
 31. 37. 16. 22. 15. 23. 18. 35. 17. 40. 38.]
Attribute: ANZ_TITEL with unique values: [0. 1. 2. 4. 3. 6.]
Attribute: HH_EINKOMMEN_SCORE with unique values: [6. 4. 1. 5. 3. 2.]
Attribute: W_KEIT_KIND_HH with unique values: [3. 2. 6. 5. 4. 1.]
Attribute: WOHNDAUER_2008 with unique values: [9. 8. 3. 4. 5. 6. 2. 7. 1.]
Attribute: ANZ_HAUSHALTE_AKTIV with unique values: [ 11.  10.   1.   3.   5.   4.   6.   2.   9.  14.  13.   8.   7.  44.
  25.  36.  16. 272.  33.  15.  12.  22.  21.  20.  67.  23.  18.  27.
  24.  31.  39.  19.  30.  17.  29.  61.  54.  88.  49.  60.  38.  34.
  48.  35.  62.  26.  45.  42.  28.  77.  47.  50. 236. 137.  43.  52.
  41.  76.  63.  32.  40.  37.  59.  53.  57. 118.  66.  95.  78.  70.
  46. 185.  73. 154.  80. 102.  92. 126.  94.  71.  68.  81.  69. 121.
 258.  58.  86. 124.  85.  90. 128.  51.  72.  64.  56.  83. 150. 259.
  91. 160. 134. 169. 246.  82. 114. 183. 116.  74. 143.  99. 216.  98.
 125. 112. 123.  55. 120.  75. 141. 103.  87.  89. 223.  93. 179. 135.
 149.  97. 101. 110.  79. 111. 109.  65.  84. 367. 162. 195. 107. 119.
 331. 122. 161. 157. 175. 177. 136. 104. 316. 127. 100. 180. 155. 133.
 178. 176. 108. 353. 131. 130.  96. 256. 117. 146. 211. 199. 140. 200.
 106. 285. 263. 314. 253. 163. 377. 171. 148. 191. 243. 153. 145. 174.
 172. 190. 142. 317. 244. 189. 286. 105. 170. 215. 321. 113. 115. 156.
 129. 242. 132. 280. 159. 138. 158. 267. 151. 164. 187. 206. 266. 265.
 209. 196. 139. 193. 184. 210. 260. 366. 194. 318. 344. 277. 595. 197.
 247. 165. 283. 326. 181. 186. 438. 222. 144. 378. 237. 152. 208. 274.
 430. 202. 214. 231. 219. 225. 230. 198. 224. 333. 168. 221. 523. 254.
 305. 166. 252. 445. 147. 301. 213. 536. 220. 249. 290. 515. 276. 241.
 204. 250. 218. 226. 201. 240. 404. 348. 328. 228. 304. 167. 346. 205.
 207. 293. 255. 307. 414. 311. 347. 379. 395. 229. 232.]
Attribute: ANZ_HH_TITEL with unique values: [ 0.  1.  5.  2.  3.  7.  4.  6.  9. 15. 14.  8. 11. 10. 12. 13. 20. 16.
 17. 23. 18.]
Attribute: KONSUMNAEHE with unique values: [1. 5. 4. 3. 2. 6. 7.]
Attribute: MIN_GEBAEUDEJAHR with unique values: [1992. 1997. 2005. 2009. 1994. 1996. 2002. 2015. 1991. 1993. 1995. 2003.
 2008. 2006. 2000. 1990. 2004. 1999. 1998. 2001. 2007. 2013. 1989. 2014.
 2012. 2010. 1987. 2011. 1988. 1985. 2016. 1986.]
Attribute: KBA05_ANTG1 with unique values: [0. 1. 4. 2. 3.]
Attribute: KBA05_ANTG2 with unique values: [0. 3. 1. 4. 2.]
Attribute: KBA05_ANTG3 with unique values: [0. 1. 2. 3.]
Attribute: KBA05_ANTG4 with unique values: [2. 0. 1.]
Attribute: KBA05_GBZ with unique values: [1. 3. 4. 5. 2.]
Attribute: BALLRAUM with unique values: [6. 2. 4. 3. 7. 1. 5.]
Attribute: EWDICHTE with unique values: [3. 4. 2. 5. 6. 1.]
Attribute: INNENSTADT with unique values: [8. 4. 6. 1. 7. 3. 2. 5.]
Attribute: GEBAEUDETYP_RASTER with unique values: [3. 4. 5. 1. 2.]
Attribute: KKK with unique values: [2. 3. 4. 1.]
Attribute: MOBI_REGIO with unique values: [1. 3. 4. 5. 2. 6.]
Attribute: ONLINE_AFFINITAET with unique values: [3. 2. 1. 5. 4. 0.]
Attribute: REGIOTYP with unique values: [3. 2. 5. 1. 7. 6. 4.]
Attribute: KBA13_ANZAHL_PKW with unique values: [963. 712. 596. ...   2.  30.   7.]
Attribute: PLZ8_ANTG1 with unique values: [2. 3. 1. 4. 0.]
Attribute: PLZ8_ANTG2 with unique values: [3. 2. 4. 1. 0.]
Attribute: PLZ8_ANTG3 with unique values: [2. 1. 3. 0.]
Attribute: PLZ8_ANTG4 with unique values: [1. 0. 2.]
Attribute: PLZ8_HHZ with unique values: [5. 4. 3. 2. 1.]
Attribute: PLZ8_GBZ with unique values: [4. 3. 5. 2. 1.]
Attribute: ARBEIT with unique values: [3. 2. 4. 1. 5.]
Attribute: ORTSGR_KLS9 with unique values: [5. 3. 6. 4. 8. 2. 7. 9. 1.]
Attribute: RELAT_AB with unique values: [4. 2. 3. 5. 1.]

Step 1.3.1: Re-Encode Categorical Features¶

For categorical data, we would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, we perform one of the following:

  • For binary (two-level) categoricals that take numeric values, we can keep them without needing to do anything.
  • There is one binary variable that takes on non-numeric values. For this one, we need to re-encode the values as numbers or create a dummy variable.
  • For multi-level categoricals (three or more values), you can choose to encode the values using multiple dummy variables (e.g. via pandas.get_dummies), or (to keep things straightforward) just drop them from the analysis. As always, document your choices in the Discussion section.
In [ ]:
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?
categorical_feats = cleaned_feat_info[cleaned_feat_info['type'] == 'categorical']['attribute'].values.tolist()

encoded_columns = [feat for feat in categorical_feats if azdias_below_thresh[feat].nunique(dropna=True) > 2]

for feat in encoded_columns:
    unique_values = azdias_below_thresh[feat].dropna().unique()
    print(f'Feature: {feat} has unique values (excluding NaN): {unique_values}')
Feature: CJT_GESAMTTYP has unique values (excluding NaN): [5. 3. 2. 4. 1. 6.]
Feature: FINANZTYP has unique values (excluding NaN): [1 6 5 2 4 3]
Feature: GFK_URLAUBERTYP has unique values (excluding NaN): [10.  1.  5. 12.  9.  3.  8. 11.  4.  2.  7.  6.]
Feature: LP_STATUS_FEIN has unique values (excluding NaN): [ 2.  3.  9.  4.  1. 10.  5.  8.  6.  7.]
Feature: NATIONALITAET_KZ has unique values (excluding NaN): [1. 3. 2.]
Feature: SHOPPER_TYP has unique values (excluding NaN): [3. 2. 1. 0.]
Feature: ZABEOTYP has unique values (excluding NaN): [5 3 4 1 6 2]
Feature: GEBAEUDETYP has unique values (excluding NaN): [8. 1. 3. 2. 6. 4. 5.]
Feature: CAMEO_DEUG_2015 has unique values (excluding NaN): ['8' '4' '2' '6' '1' '9' '5' '7' '3']
Feature: CAMEO_DEU_2015 has unique values (excluding NaN): ['8A' '4C' '2A' '6B' '8C' '4A' '2D' '1A' '1E' '9D' '5C' '8B' '7A' '5D'
 '9E' '9B' '1B' '3D' '4E' '4B' '3C' '5A' '7B' '9A' '6D' '6E' '2C' '7C'
 '9C' '7D' '5E' '1D' '8D' '6C' '6A' '5B' '4D' '3A' '2B' '7E' '3B' '6F'
 '5F' '1C']
In [ ]:
# Re-encode categorical variable(s) to be kept in the analysis.

# First change the CAMEO_DEUG_2015 to be numric only not strings
azdias_below_thresh['CAMEO_DEUG_2015'] = pd.to_numeric(azdias_below_thresh['CAMEO_DEUG_2015'], errors='coerce')

# Manual Encoding of OST_WEST_KZ
azdias_below_thresh['OST_WEST_KZ'] = azdias_below_thresh['OST_WEST_KZ'].map({'W': 1, 'O': 0, np.nan: np.nan})

# OneHotEncoding of multi-level categoricals
azdias_below_thresh = pd.get_dummies(azdias_below_thresh, columns=encoded_columns)
C:\Users\Ziad\AppData\Local\Temp\ipykernel_6872\3067772421.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  azdias_below_thresh['CAMEO_DEUG_2015'] = pd.to_numeric(azdias_below_thresh['CAMEO_DEUG_2015'], errors='coerce')
C:\Users\Ziad\AppData\Local\Temp\ipykernel_6872\3067772421.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  azdias_below_thresh['OST_WEST_KZ'] = azdias_below_thresh['OST_WEST_KZ'].map({'W': 1, 'O': 0, np.nan: np.nan})

Discussion 1.3.1: Re-Encode Categorical Features¶

We first checked the ordinal and numerical values and all of them are int (or float) which is fine for our analysis and algorithm after. We visualised the categorical features and found that CAMEO_DEUG_2015 feature contains number strings so these have to be converted to integer first before one hot encoding it. OST_WEST_KZ was manually one hot encoded as it's a binary feature.

Finally, all categorical features which have more that 2 values (excluding NaN) are to be OneHotEncoded.

Step 1.3.2: Engineer Mixed-Type Features¶

There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to our own choices:

  • "PRAEGENDE_JUGENDJAHRE" combines information on three dimensions: generation by decade, movement (mainstream vs. avantgarde), and nation (east vs. west). While there aren't enough levels to disentangle east from west, we should create two new variables to capture the other two dimensions: an interval-type variable for decade, and a binary variable for movement.
  • "CAMEO_INTL_2015" combines information on two axes: wealth and life stage. Break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values).
In [ ]:
mixed_feats = cleaned_feat_info[cleaned_feat_info['type'] == 'mixed']['attribute'].values.tolist()
print(mixed_feats)
['LP_LEBENSPHASE_FEIN', 'PRAEGENDE_JUGENDJAHRE', 'WOHNLAGE', 'CAMEO_INTL_2015', 'PLZ8_BAUMAX']
In [ ]:
# Investigate "PRAEGENDE_JUGENDJAHRE" and engineer two new variables.

# Define mappings for decade and movement
decade_mapping = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 3, 8: 4, 9: 4, 10: 5, 11: 5, 12: 6, 13: 6, 14: 7, 15: 7}
movement_mapping = {1: 1, 2: 0, 3: 1, 4: 0, 5: 1, 6: 0, 7: 0, 8: 1, 9: 1, 10: 0, 11: 0, 12: 1, 13: 1, 14: 0, 15: 0}

# Apply the mappings directly to create new columns
azdias_below_thresh['PRAEGENDE_JUGENDJAHRE__DECADE'] = azdias_below_thresh['PRAEGENDE_JUGENDJAHRE'].replace(decade_mapping)
azdias_below_thresh['PRAEGENDE_JUGENDJAHRE__MOVEMENT'] = azdias_below_thresh['PRAEGENDE_JUGENDJAHRE'].replace(movement_mapping)

# Drop the original 'PRAEGENDE_JUGENDJAHRE' feature
azdias_below_thresh.drop(columns=['PRAEGENDE_JUGENDJAHRE'], inplace=True)
mixed_feats.remove('PRAEGENDE_JUGENDJAHRE')
In [ ]:
# Investigate "CAMEO_INTL_2015" and engineer two new variables.

# Change feature "CAMEO_INTL_2015" to integers first
azdias_below_thresh['CAMEO_INTL_2015'] = pd.to_numeric(azdias_below_thresh['CAMEO_INTL_2015'], errors='coerce')

# Create a new column 'WEALTH' by extracting the 'tens'-place digit
azdias_below_thresh['CAMEO_INTL_2015__WEALTH'] = azdias_below_thresh['CAMEO_INTL_2015'] // 10

# Create a new column 'LIFE_STAGE' by extracting the 'ones'-place digit
azdias_below_thresh['CAMEO_INTL_2015__LIFE_STAGE'] = azdias_below_thresh['CAMEO_INTL_2015'] % 10

# Drop the original 'CAMEO_INTL_2015' feature
azdias_below_thresh.drop(columns=['CAMEO_INTL_2015'], inplace=True)
mixed_feats.remove('CAMEO_INTL_2015')

Discussion 1.3.2: Engineer Mixed-Type Features¶

We handled the two mixed-features that required attention. In the next cell we will also feature engineer 2 more features from WOHNLAGE feature. This feature explains the neighborhood quality (or rural flag) which is useful for our algorithm. Also, we will OHE(OneHotEncode) PLZ8_BAUMAX as it could help us when measuring building type. The other feature LP_LEBENSPHASE_FEIN will be dropped as it is just Life stage, fine scale which we calculated earlier.

Step 1.3.3: Complete Feature Selection¶

In order to finish this step up, we need to make sure that our data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:

  • All numeric, interval, and ordinal type columns from the original dataset.
  • Binary categorical features (all numerically-encoded).
  • Engineered features from other multi-level categorical features and mixed features.

Make sure that for any new columns that we have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, we should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, our data should only be from the subset with few or no missing values.

In [ ]:
# Extra feature engineering task

# OHE PLZ8_BAUMAX
azdias_below_thresh = pd.get_dummies(azdias_below_thresh, columns=['PLZ8_BAUMAX'])
mixed_feats.remove('PLZ8_BAUMAX')

# Create binary features for quality categories
quality_categories = [1, 2, 3, 4, 5]
for category in quality_categories:
    azdias_below_thresh[f'QUALITY_{category}'] = (azdias_below_thresh['WOHNLAGE'] == category).astype(int)

# Create a binary feature for rural vs. urban buildings
azdias_below_thresh['RURAL'] = ((azdias_below_thresh['WOHNLAGE'] == 7) | (azdias_below_thresh['WOHNLAGE'] == 8)).astype(int)

# Drop the original 'WOHNLAGE' feature
azdias_below_thresh.drop(columns=['WOHNLAGE'], inplace=True)
mixed_feats.remove('WOHNLAGE')
In [ ]:
# Drop remaining mixed features
azdias_below_thresh.drop(columns=mixed_feats, inplace=True)
azdias_below_thresh.reset_index(drop=True, inplace=True)

Step 1.4: Create a Cleaning Function¶

Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.

In [ ]:
def clean_data(df):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data
    
    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """
    df_new = df.copy()

    # Replace missing or unknown values with NaN
    for attribute, missing_values in feat_info_dict.items():
        if missing_values is not None:
            df_new[attribute].replace(missing_values, np.nan, inplace=True)
    
    # Remove the outlier columns from the dataset
    outliers = ['TITEL_KZ', 'AGER_TYP', 'KK_KUNDENTYP', 'KBA05_BAUMAX', 'GEBURTSJAHR', 'ALTER_HH']
    df_new.drop(outliers, axis=1, inplace=True)

    threshold = 25
    df_new = df_new[row_nan_count<=threshold]
    df_new.reset_index(drop=True, inplace=True)

    # Drop the highly correlated columns from the DataFrame
    df_new.drop(columns=columns_to_drop, inplace=True)
    df_new.reset_index(drop=True, inplace=True)

    # Change the CAMEO_DEUG_2015 to be numric only not strings
    df_new['CAMEO_DEUG_2015'] = pd.to_numeric(df_new['CAMEO_DEUG_2015'], errors='coerce')

    # Manual Encoding of OST_WEST_KZ
    df_new['OST_WEST_KZ'] = df_new['OST_WEST_KZ'].map({'W': 1, 'O': 0, np.nan: np.nan})

    # OneHotEncoding of multi-level categoricals
    df_new = pd.get_dummies(df_new, columns=encoded_columns)

    # Define mappings for decade and movement
    decade_mapping = {1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 3, 8: 4, 9: 4, 10: 5, 11: 5, 12: 6, 13: 6, 14: 7, 15: 7}
    movement_mapping = {1: 1, 2: 0, 3: 1, 4: 0, 5: 1, 6: 0, 7: 0, 8: 1, 9: 1, 10: 0, 11: 0, 12: 1, 13: 1, 14: 0, 15: 0}

    # Apply the mappings directly to create new columns
    df_new['PRAEGENDE_JUGENDJAHRE__DECADE'] = df_new['PRAEGENDE_JUGENDJAHRE'].replace(decade_mapping)
    df_new['PRAEGENDE_JUGENDJAHRE__MOVEMENT'] = df_new['PRAEGENDE_JUGENDJAHRE'].replace(movement_mapping)

    # Drop the original 'PRAEGENDE_JUGENDJAHRE' feature
    df_new.drop(columns=['PRAEGENDE_JUGENDJAHRE'], inplace=True)

    # Change feature "CAMEO_INTL_2015" to integers first
    df_new['CAMEO_INTL_2015'] = pd.to_numeric(df_new['CAMEO_INTL_2015'], errors='coerce')

    # Create a new column 'WEALTH' by extracting the 'tens'-place digit
    df_new['CAMEO_INTL_2015__WEALTH'] = df_new['CAMEO_INTL_2015'] // 10

    # Create a new column 'LIFE_STAGE' by extracting the 'ones'-place digit
    df_new['CAMEO_INTL_2015__LIFE_STAGE'] = df_new['CAMEO_INTL_2015'] % 10

    # Drop the original 'CAMEO_INTL_2015' feature
    df_new.drop(columns=['CAMEO_INTL_2015'], inplace=True)

    # OHE PLZ8_BAUMAX
    df_new = pd.get_dummies(df_new, columns=['PLZ8_BAUMAX'])

    quality_categories = [1, 2, 3, 4, 5]
    for category in quality_categories:
        df_new[f'QUALITY_{category}'] = (df_new['WOHNLAGE'] == category).astype(int)

    # Create a binary feature for rural vs. urban buildings
    df_new['RURAL'] = ((df_new['WOHNLAGE'] == 7) | (df_new['WOHNLAGE'] == 8)).astype(int)

    # Drop the original 'WOHNLAGE' feature
    df_new.drop(columns=['WOHNLAGE'], inplace=True)

    # Drop remaining mixed features
    df_new.drop(columns=mixed_feats, inplace=True)
    df_new.reset_index(drop=True, inplace=True)
    # Return the cleaned dataframe.
    return df_new

Step 2: Feature Transformation¶

Step 2.1: Apply Feature Scaling¶

Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features.

  • sklearn requires that data not have missing values in order for its estimators to work properly. So, before applying the scaler to our data, make sure that we've cleaned the DataFrame of the remaining missing values. This can be as simple as just removing all data points with missing data, or applying an Imputer to replace all missing values. You might also try a more complicated procedure where you temporarily remove missing values in order to compute the scaling parameters before re-introducing those missing values and applying imputation. Think about how much missing data you have and what possible effects each approach might have on your analysis, and justify your decision in the discussion section below.
  • For the actual scaling function, a StandardScaler instance is suggested, scaling each feature to mean 0 and standard deviation 1.
  • For these classes, we can make use of the .fit_transform() method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since we'll be applying them to the customer demographics data towards the end of the project.
In [ ]:
# Create an instance of SimpleImputer with your mean strategy
imputer = SimpleImputer(strategy='mean')

azdias_imputed = imputer.fit_transform(azdias_below_thresh)

azdias_imputed = pd.DataFrame(azdias_imputed, columns=azdias_below_thresh.columns)

# Calculate NaN counts before and after imputation
nan_count_before = azdias_below_thresh.isna().sum().sum()
nan_count_after = azdias_imputed.isna().sum().sum()

print(f'NaN Count before imputation = {nan_count_before}\n')
print(f'NaN Count after imputation = {nan_count_after}')
NaN Count before imputation = 761265

NaN Count after imputation = 0
In [ ]:
# Apply feature scaling to the general population demographics data.
scaler = StandardScaler()

final_azdias = pd.DataFrame(scaler.fit_transform(azdias_imputed), columns=azdias_imputed.columns)

Discussion 2.1: Apply Feature Scaling¶

We kept our missing data but have done a mean imputation on them, after that a scalar has been applied to the whole dataset in order for PCA to work properly.

Step 2.2: Perform Dimensionality Reduction¶

On our scaled data, we are now ready to apply dimensionality reduction techniques.

  • Use sklearn's PCA class to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To start, we should not set any parameters (so all components are computed) or set a number of components that is at least half the number of features (so there's enough features to see the general trend in variability).
  • Check out the ratio of variance explained by each principal component as well as the cumulative variance explained. Try plotting the cumulative or sequential values using matplotlib's plot() function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project.
  • Once we've made a choice for the number of components to keep, make sure we re-fit a PCA instance to perform the decided-on transformation.
In [ ]:
def scree_plot(pca):
    explained_variance = pca.explained_variance_ratio_
    cumulative_variance = explained_variance.cumsum()

    plt.figure(figsize=(10, 6))
    plt.bar(range(1, len(explained_variance) + 1), explained_variance, label='Individual Component Variance', alpha=0.7, color='skyblue')
    plt.plot(range(1, len(explained_variance) + 1), cumulative_variance, label='Cumulative Variance', marker='o', markersize=2, linestyle='-', color='orange')
    plt.xlabel('Principal Component Number')
    plt.ylabel('Variance Explained (%)')
    plt.title('Scree Plot: Explained Variance by Principal Components')
    plt.legend()
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.show()
In [ ]:
# Apply PCA to the data.
pca = PCA()
pca.fit(final_azdias)
Out[ ]:
PCA()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA()
In [ ]:
# Investigate the variance accounted for by each principal component.
scree_plot(pca)

In this specific Scree Plot, we observe that the cumulative explained variance reaches 100% (total variance) at around 175 principal components. However, to achieve an explained variance of approximately 95%, we only need between 110 and 130 components approximately.

Selecting this reduced set of principal components, while still explaining 95% of the data's variance, offers a significant reduction in dimensionality. This reduction streamlines subsequent analyses and modeling while retaining the essential information present in the data.

So, we will be applying PCA again but with 0.95 as argument and compare the number of components to our visualisation.

In [ ]:
print(f'Number of features before PCA: {final_azdias.shape[1]}')

# Apply PCA to achieve 95% variance
pca = PCA(n_components=0.95)
pca.fit(final_azdias)

# Print the number of components for PCA while achieving 95% Variance
print(f'Number of components for PCA while achieving 95% Variance: {pca.n_components_}')
Number of features before PCA: 182
Number of components for PCA while achieving 95% Variance: 121
In [ ]:
# Transform DataFrame using PCA
final_azdias_with_pca = pca.transform(final_azdias)
In [ ]:
# Visualise the variance accounted for by each principal component with 95% variance.
scree_plot(pca)

Discussion 2.2: Perform Dimensionality Reduction¶

We first fitted PCA without any arguments setting and we found that in order to keep 95% variance that the value for number of components is somehow between 110 and 130. After running PCA again but setting n_components=0.95, the result number of principal components are 121.

We managed to reduce features from originally 182 to 121.

Step 2.3: Interpret Principal Components¶

Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.

As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.

  • To investigate the features, you should map each weight to their corresponding feature name, then sort the features according to weight. The most interesting features for each principal component, then, will be those at the beginning and end of the sorted list. Use the data dictionary document to help you understand these most prominent features, their relationships, and what a positive or negative value on the principal component might indicate.
  • You should investigate and interpret feature associations from the first three principal components in this substep. To help facilitate this, you should write a function that you can call at any time to print the sorted list of feature weights, for the i-th principal component. This might come in handy in the next step of the project, when you interpret the tendencies of the discovered clusters.
In [ ]:
def plot_component(component_weight, feature_names, title):
    """
    Plot the top features for a single principal component with Seaborn styling.

    Args:
        component_weight (array-like): Weights for the principal component.
        feature_names (list): List of feature names.
        title (string): Title of the plot.

    Returns:
        None
    """
    # Sort features by their absolute weights in descending order
    sorted_features = sorted(zip(feature_names, component_weight), key=lambda x: abs(x[1]), reverse=True)
    top_features = sorted_features[:10]

    # Extract feature names and corresponding weights
    feature_names, weights = zip(*top_features)
    
    # Create a bar plot
    plt.figure(figsize=(10, 6))
    color = ['skyblue' if w >= 0 else 'lightcoral' for w in weights]  # Positive weights in blue, negative in red
    plt.barh(np.arange(len(feature_names)), weights, color=color)
    plt.yticks(np.arange(len(feature_names)), feature_names)
    plt.xlabel('Weight')
    plt.title(title)
    plt.gca().invert_yaxis()
    plt.grid(axis='x', linestyle='--', alpha=0.6)
    plt.rcParams['axes.edgecolor']='#333F4B'
    plt.rcParams['axes.linewidth']=0.8
    plt.rcParams['xtick.color']='#333F4B'
    plt.rcParams['ytick.color']='#333F4B'
    plt.tight_layout()
    plt.show()
    
    # Create a DataFrame to display the top features and their weights
    top_features_df = pd.DataFrame({'Feature': feature_names, 'Weight': weights})
    display(top_features_df)
In [ ]:
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
feature_names = final_azdias.columns.tolist()
component_weights = pca.components_

plot_component(component_weights[0], feature_names, 'First Principcal Component Weights')
Feature Weight
0 MOBI_REGIO -0.205457
1 PLZ8_ANTG3 0.194958
2 PLZ8_ANTG1 -0.194710
3 KBA05_ANTG1 -0.191092
4 PLZ8_ANTG4 0.188818
5 CAMEO_INTL_2015__WEALTH 0.187478
6 PLZ8_BAUMAX_1.0 -0.183475
7 FINANZ_MINIMALIST -0.182912
8 KBA05_GBZ -0.182760
9 HH_EINKOMMEN_SCORE 0.176077
In [ ]:
# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.
plot_component(component_weights[1], feature_names, 'Second Principcal Component Weights')
Feature Weight
0 PRAEGENDE_JUGENDJAHRE__DECADE -0.239654
1 ALTERSKATEGORIE_GROB 0.231984
2 FINANZ_SPARER -0.228029
3 FINANZ_VORSORGER 0.219796
4 FINANZ_UNAUFFAELLIGER -0.217578
5 SEMIO_REL -0.213131
6 SEMIO_TRADV -0.207335
7 SEMIO_PFLICHT -0.204418
8 FINANZ_ANLEGER -0.202930
9 ZABEOTYP_3 0.197914
In [ ]:
# Map weights for the third principal component to corresponding feature names
# and then print the linked values, sorted by weight.
plot_component(component_weights[2], feature_names, 'Third Principcal Component Weights')
Feature Weight
0 ANREDE_KZ -0.352909
1 SEMIO_VERT 0.327128
2 SEMIO_KAEM -0.325363
3 SEMIO_DOM -0.293393
4 SEMIO_KRIT -0.272109
5 SEMIO_FAM 0.262422
6 SEMIO_SOZ 0.258469
7 SEMIO_KULT 0.254445
8 SEMIO_ERL -0.208117
9 SEMIO_RAT -0.162015

Discussion 2.3: Interpret Principal Components¶

  • For the first principal component we have the following:

    • Positive correlation among HH_EINKOMMEN_SCORE, CAMEO_INTL_2015__WEALTH, PLZ8_ANTG4, PLZ8_ANTG3
    • Negative correlation among KBA05_GBZ, FINANZ_MINIMALIST, PLZ8_BAUMAX_1.0, KBA05_ANTG1, PLZ8_ANTG1, MOBI_REGIO
  • For the second principal component we have the following:

    • Positive correlation among ALTERSKATEGORIE_GROB, FINANZ_VORSORGER, ZABEOTYP_3
    • Negative correlation among FINANZ_SPARER, FINANZ_UNAUFFAELLIGER, SEMIO_REL, SEMIO_TRADV, SEMIO_PFLICHT, FINANZ_ANLEGER, PRAEGENDE_JUGENDJAHRE__DECADE
  • For the third principal component we have the following:

    • Positive correlation among SEMIO_VERT, SEMIO_FAM, SEMIO_SOZ, SEMIO_KULT
    • Negative correlation among SEMIO_RAT, SEMIO_ERL, SEMIO_KRIT, SEMIO_DOM, SEMIO_KAEM, ANREDE_KZ

Step 3: Clustering¶

Step 3.1: Apply Clustering to General Population¶

We've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, we will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.

  • Use sklearn's KMeans class to perform k-means clustering on the PCA-transformed data.
  • Then, compute the average difference from each point to its assigned cluster's center.
  • Perform the above two steps for a number of different cluster counts. You can then see how the average distance decreases with an increasing number of clusters. However, each additional cluster provides a smaller net benefit.
  • Once we've selected a final number of clusters to use, re-fit a KMeans instance to perform the clustering operation. Make sure that you also obtain the cluster assignments for the general demographics data, since we'll be using them in the final Step 3.3.
In [ ]:
centers = range(10,31)
inertia_scores = []


for center in centers:
    # run k-means clustering on the data.
    kmeans = KMeans(n_clusters=center, n_init='auto')
    kmeans.fit(final_azdias_with_pca)
    
    # Add the inertia values
    inertia_scores.append(kmeans.inertia_)
In [ ]:
# Investigate the change in within-cluster distance across number of clusters.
plt.plot(centers, inertia_scores, linestyle='-', marker='o', color='b')
plt.title("Elbow Method for Optimal Number of Clusters - [Inertia vs k]", fontsize=16)
plt.xlabel("Number of Clusters (k)", fontsize=12)
plt.ylabel("Inertia", fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.tight_layout()
plt.show()
In [ ]:
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans_optimal = KMeans(n_clusters=25, n_init='auto')
kmeans_optimal.fit(final_azdias_with_pca)

azdias_preds = kmeans_optimal.predict(final_azdias_with_pca)

Discussion 3.1: Apply Clustering to General Population¶

We applied k-means with clusters ranging from 10 - 30 to pick the best number of clusters using Elbow method which accounts for the inertia score for the fitted data.

We then pick an optimal k-means with the right number of clusters to continue with for the rest of the project.

Although it's not a perfect Elbow diagram but we can see in the graph it's decreasing slowly near k=25 and that was our choice.

Step 3.2: Apply All Steps to the Customer Data¶

Now that we have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, we're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, we will interpret how the general population fits apply to the customer data.

  • Don't forget when loading in the customers data, that it is semicolon (;) delimited.
  • Apply the same feature wrangling, selection, and engineering steps to the customer demographics using the clean_data() function we created earlier.
  • Use the sklearn objects from the general demographics data, and apply their transformations to the customers data. That is, you should not be using a .fit() or .fit_transform() method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.
In [ ]:
# Load in the customer demographics data.
customers = pd.read_csv('dataset/Udacity_CUSTOMERS_Subset.csv', sep=';')

# Apply the same feature wrangling, selection, and engineering steps to the customer demographics.
final_customers = clean_data(customers)
C:\Users\Ziad\AppData\Local\Temp\ipykernel_6872\3402177374.py:21: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df_new = df_new[row_nan_count<=threshold]
In [ ]:
# Get the column names of each DataFrame
columns_azdias = set(azdias_below_thresh.columns)
columns_customers = set(final_customers.columns)

# Find the columns that exist in df1 but not in df2
columns_only_in_azdias = columns_azdias - columns_customers

# Find the columns that exist in df2 but not in df1
columns_only_in_customers = columns_customers - columns_azdias

if not columns_only_in_azdias and not columns_only_in_customers:
    print("Both DataFrames have the same columns.")
else:
    if columns_only_in_azdias:
        print(f"Columns only in azdias and not in customers: {columns_only_in_azdias}")
    if columns_only_in_customers:
        print(f"Columns only in customers and not in azdias: {columns_only_in_customers}")
Columns only in azdias and not in customers: {'GEBAEUDETYP_5.0'}
In [ ]:
# Add missing column
final_customers['GEBAEUDETYP_5.0'] = 0

# Reorder the columns after adding the new column
final_customers = final_customers[azdias_below_thresh.columns]
In [ ]:
# Apply imputer to fill missing values in 'final_customers'
customers_imputed = pd.DataFrame(imputer.transform(final_customers), columns=final_customers.columns)

# Scale the imputed data
customers_scaled = pd.DataFrame(scaler.transform(customers_imputed), columns=customers_imputed.columns)

# Apply PCA transformation
customers_with_pca = pca.transform(customers_scaled)

# Make predictions
customers_preds = kmeans_optimal.predict(customers_with_pca)

Step 3.3: Compare Customer Data to Demographics Data¶

At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.

Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.

Take a look at the following points in this step:

  • Compute the proportion of data points in each cluster for the general population and the customer data. Visualizations will be useful here: both for the individual dataset proportions, but also to visualize the ratios in cluster representation between groups. Seaborn's countplot() or barplot() function could be handy.
    • Recall the analysis you performed in step 1.1.3 of the project, where you separated out certain data points from the dataset if they had more than a specified threshold of missing values. If you found that this group was qualitatively different from the main bulk of the data, you should treat this as an additional data cluster in this analysis. Make sure that you account for the number of data points in this subset, for both the general population and customer datasets, when making your computations!
  • Which cluster or clusters are overrepresented in the customer dataset compared to the general population? Select at least one such cluster and infer what kind of people might be represented by that cluster. Use the principal component interpretations from step 2.3 or look at additional components to help you make this inference. Alternatively, you can use the .inverse_transform() method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly.
  • Perform a similar investigation for the underrepresented clusters. Which cluster or clusters are underrepresented in the customer dataset compared to the general population, and what kinds of people are typified by these clusters?
In [ ]:
# Compare the proportion of data in each cluster for the customer data to the
# proportion of data in each cluster for the general population.

# Determine the number of clusters
num_clusters = max(max(azdias_preds), max(customers_preds)) + 1

# Count the occurrences of each cluster in both datasets
azdias_counts = np.bincount(azdias_preds, minlength=num_clusters)
customers_counts = np.bincount(customers_preds, minlength=num_clusters)

# Create an array of cluster indices
cluster_indices = np.arange(num_clusters)
bar_width = 0.35
fig, ax = plt.subplots(figsize=(10, 6))
bars1 = ax.bar(cluster_indices - bar_width/2, azdias_counts, bar_width, label='Demographics Predictions', alpha=0.7, color='b')
bars2 = ax.bar(cluster_indices + bar_width/2, customers_counts, bar_width, label='Customer Predictions', alpha=0.7, color='g')
ax.set_xlabel('Number of Cluster')
ax.set_ylabel('Count')
ax.set_title('Comparison of K-Means Clusters between Customer Data and Demographics Data')
ax.set_xticks(cluster_indices)
ax.legend()
plt.tight_layout()
plt.show()
In [ ]:
# List of important features to be inspected
important_features = ['ALTERSKATEGORIE_GROB', 'ANREDE_KZ', 'HH_EINKOMMEN_SCORE', 'RURAL', 'QUALITY_1', 
                      'QUALITY_2', 'QUALITY_3', 'QUALITY_4', 'QUALITY_5', 'CAMEO_INTL_2015__WEALTH', 
                      'CAMEO_INTL_2015__LIFE_STAGE', 'PLZ8_BAUMAX_1.0', 'PLZ8_BAUMAX_2.0', 'PLZ8_BAUMAX_3.0', 
                      'PLZ8_BAUMAX_4.0', 'PLZ8_BAUMAX_5.0', 'REGIOTYP']
In [ ]:
# What kinds of people are part of a cluster that is overrepresented in the
# customer data compared to the general population?
cluster_10 = kmeans_optimal.cluster_centers_[10]
over_pca_reversed = pca.inverse_transform(cluster_10).reshape(1, -1)
over_scaled_reversed = scaler.inverse_transform(over_pca_reversed).reshape(-1)
over_result = pd.DataFrame(over_scaled_reversed, columns=['Features'], index=final_customers.columns)
print(over_result.loc[important_features, 'Features'])
ALTERSKATEGORIE_GROB           1.869091
ANREDE_KZ                      1.965022
HH_EINKOMMEN_SCORE             4.118406
RURAL                          0.474269
QUALITY_1                      0.047516
QUALITY_2                      0.112978
QUALITY_3                      0.235639
QUALITY_4                      0.082051
QUALITY_5                      0.041367
CAMEO_INTL_2015__WEALTH        2.366753
CAMEO_INTL_2015__LIFE_STAGE    3.359207
PLZ8_BAUMAX_1.0                0.953657
PLZ8_BAUMAX_2.0                0.018305
PLZ8_BAUMAX_3.0                0.002127
PLZ8_BAUMAX_4.0               -0.000844
PLZ8_BAUMAX_5.0                0.001536
REGIOTYP                       4.259116
Name: Features, dtype: float64
In [ ]:
# What kinds of people are part of a cluster that is underrepresented in the
# customer data compared to the general population?
cluster_12 = kmeans_optimal.cluster_centers_[12]
under_pca_reversed = pca.inverse_transform(cluster_12).reshape(1, -1)
under_scaled_reversed = scaler.inverse_transform(under_pca_reversed).reshape(-1)
under_result = pd.DataFrame(under_scaled_reversed, columns=['Features'], index=final_customers.columns)
print(under_result.loc[important_features, 'Features'])
ALTERSKATEGORIE_GROB           1.831009
ANREDE_KZ                      0.953732
HH_EINKOMMEN_SCORE             5.581475
RURAL                          0.026054
QUALITY_1                      0.020037
QUALITY_2                      0.088182
QUALITY_3                      0.369533
QUALITY_4                      0.303654
QUALITY_5                      0.187990
CAMEO_INTL_2015__WEALTH        4.530094
CAMEO_INTL_2015__LIFE_STAGE    1.783870
PLZ8_BAUMAX_1.0                0.201109
PLZ8_BAUMAX_2.0                0.141738
PLZ8_BAUMAX_3.0                0.144613
PLZ8_BAUMAX_4.0                0.177179
PLZ8_BAUMAX_5.0                0.311444
REGIOTYP                       4.974027
Name: Features, dtype: float64

Discussion 3.3: Compare Customer Data to Demographics Data¶

From the figure cluster 10 for customer data is relatively overrepresented compared demographics data, while cluster 12 is the opposite.

Here is some remarks and findings regarding our clustered data

  • From cluster 10, we can infer that customers with the following features are the main target audience for the mail-order sales:

    • Female customers (ANREDE_KZ=1.965) between 30 - 45 years old (ALTERSKATEGORIE_GROB=1.87)
    • Average net income household (HH_EINKOMMEN_SCORE=4.118406), Middle class (REGIOTYP=4.2)
    • Family Living in Average or Poor Neighborhood (QUALITY_3 & QUALITY_4 are the maximum among all qualities)
  • From cluster 12, we can infer that customers with the following features are the not necessarily target audience for the mail-order sales:

    • Male customers (ANREDE_KZ=0.95) between 30 - 45 years old (ALTERSKATEGORIE_GROB=1.83)
    • Low net income household (HH_EINKOMMEN_SCORE=5.58), Low-Middle class (REGIOTYP=4.97)
    • Less Affluent Households - Pre-Family Couples & Singles (CAMEO_INTL_2015)